Long to Wide for multiple time varying covariates*

Brookline Cole

Join Date: Sep 2019
Posts: 11

Long to Wide for multiple time varying covariates*

27 Sep 2019, 16:05

Hi all, I am trying to convert my data from long format to wide. Each patient (ID), has multiple ECGs, multiple echocardiograms and pacemaker checks at different times points. Below is a simplified version of my database. DIAGNOSISDATE, is when a patient (ID) was diagnosed with heart disease. Then had multiple tests done during follow up. I have the DATE of each test, time from diagnosis for each test (in days) and a variable attributable to each test. For instance, for each ECHODATE I measured patients heart volume, and for each PACEMAKER check date I recorded whether it was abnormal or not, etc. Now all these time varying variables have different dates. I assume the value for my measurements does not change in between my measurements so I should be able to carry forward prior observations. I thought if I create a variable that would include ALL PROCEDURE DATES (ECG, ECHO, PACEMAKER), then I can sort by ID and for every mission row carry forward the prior observation for each variable, but I cannot seem to find a way to do that. Is there a better way to do this before reshaping the data from Long to Wide? Sorry if this is a very basic question. Thank you for your help.

ID	DIAGNOSISDATE	ECHODATE_Dx	ECHODATE	HEARTVOL	PACEDATE_Dx	PACEDATE	ABNLPACE	ECGDATE-dx	ECGDATE	ABNLECG
2	4/27/09	4	5/1/09	22.46	1289	11/6/12	0.00	24.00	5/21/2009	1
2	4/27/09	1289	11/6/12	25.46	1456	4/22/13	1.00	441.00	7/12/2010	1
2	4/27/09	2065	12/22/14	26.82	2065	12/22/14	1.00	756.00	5/23/2011	1
2	4/27/09	2464	1/25/16	27.66	3059	9/11/17	1.00	1289.00	11/6/2012	1
2	4/27/09	3213	2/12/18	28.83	3213	2/12/18	0.00	1456.00	4/22/2013	1
2	4/27/09	3724	7/8/19	27.00	3724	7/8/19	0.00	2065.00	12/22/2014	1
2	4/27/09							2464.00	1/25/2016	1
2	4/27/09							2527.00	3/28/2016	1
2	4/27/09							3059.00	9/11/2017	1
2	4/27/09							3216.00	2/15/2018	1
2	4/27/09							3724.00	7/8/2019	1
1	1/8/07	21	1/29/07	18.89	1764	11/7/11	0.00	217.00	8/13/07	1
1	1/8/07	1291	7/22/10	19.25	1852	2/3/12	0.00	469.00	4/21/08	1
1	1/8/07	1680	8/15/11	23.34	2072	9/10/12	0.00	903.00	6/29/09	1
1	1/8/07	2111	10/19/12	22.13	2086	9/24/12	0.00	1107.00	1/19/10	1
1	1/8/07	2283	4/9/13	21.15	2111	10/19/12	0.00	1568.00	4/25/11	1
1	1/8/07	3003	3/30/15	23.41	2123	10/31/12	0.00	2122.00	10/30/12	1
1	1/8/07	3934	10/16/17	19.50	2254	3/11/13	0.00	2312.00	5/8/13	1
1	1/8/07				2282	4/8/13	12.00	3003.00	3/30/15	1
1	1/8/07				2312	5/8/13	0.00	3934.00	10/16/17	1
1	1/8/07				2327	5/23/13	0.00
1	1/8/07				2416	8/20/13	0.00
1	1/8/07				3934	10/16/17	0.00

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30071
#2

27 Sep 2019, 17:08

The above did not come from a Stata data set. It could not have, because hyphens are not legal characters in variable names. If you have not yet imported your data into Stata, then it is premature to ask for help with coding it. If you have, then you should post back with an excerpt of your data that comes from that data set, and you should do it using the -dataex- command. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment

Brookline Cole

Join Date: Sep 2019
Posts: 11

30 Sep 2019, 10:16

Thank you so much for explaining and my apologies for posting the tabular form. I am using Stata/SE 16.0. Would greatly appreciate your help.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte ID int(DIAGNOSISDATE TTEDATE_Dx ECHODATE) double HEARTVOL int(PACEDIAG PACEDATE) byte ABNLPACE int ECGdatedx str10 ECGDate byte ABNLECG
2 18014    4 18018 22.46 1289 19303  0   24 "5/21/2009"  1
2 18014 1289 19303 25.46 1456 19470  1  441 "7/12/2010"  1
2 18014 2065 20079 26.82 2065 20079  1  756 "5/23/2011"  1
2 18014 2464 20478 27.66 3059 21073  1 1289 "11/6/2012"  1
2 18014 3213 21227 28.83 3213 21227  0 1456 "4/22/2013"  1
2 18014 3724 21738    27 3724 21738  0 2065 "12/22/2014" 1
2 18014    .     .     .    .     .  . 2464 "1/25/2016"  1
2 18014    .     .     .    .     .  . 2527 "3/28/2016"  1
2 18014    .     .     .    .     .  . 3059 "9/11/2017"  1
2 18014    .     .     .    .     .  . 3216 "2/15/2018"  1
2 18014    .     .     .    .     .  . 3724 "7/8/2019"   1
1 17174   21 17195 18.89 1764 18938  0  217 " 8/13/2007" 1
1 17174 1291 18465 19.25 1852 19026  0  469 " 4/21/2008" 1
1 17174 1680 18854 23.34 2072 19246  0  903 " 6/29/2009" 1
1 17174 2111 19285 22.13 2086 19260  0 1107 " 1/19/2010" 1
1 17174 2283 19457 21.15 2111 19285  0 1568 " 4/25/2011" 1
1 17174 3003 20177 23.41 2123 19297  0 2122 "10/30/2012" 1
1 17174 3934 21108  19.5 2254 19428  0 2312 "  5/8/2013" 1
1 17174    .     .     . 2282 19456 12 3003 " 3/30/2015" 1
1 17174    .     .     . 2312 19486  0 3934 "10/16/2017" 1
1 17174    .     .     . 2327 19501  0    . ""           .
1 17174    .     .     . 2416 19590  0    . ""           .
1 17174    .     .     . 3934 21108  0    . ""           .
end
format %tdnn/dd/CCYY DIAGNOSISDATE
format %tdnn/dd/CCYY ECHODATE
format %tdnn/dd/CCYY PACEDATE

Comment

Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#4

30 Sep 2019, 11:17

The last date is string. You will need to transform it in a date variable. That said, I see blank spaces in the beginning of some observations, which would prompt to add some trimming beforehand.

Best regards,

Marcos
Comment

Brookline Cole

Join Date: Sep 2019
Posts: 11

30 Sep 2019, 12:23

Thank you, does this make more sense?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte ID int(DIAGNOSISDATE TTEDATE_Dx ECHODATE) double HEARTVOL int(PACEDIAG PACEDATE) byte ABNLPACE int(ECGdatedx ECGDate) byte ABNLECG
2 18014    4 18018 22.46 1289 19303  0   24 18038 1
2 18014 1289 19303 25.46 1456 19470  1  441 18455 1
2 18014 2065 20079 26.82 2065 20079  1  756 18770 1
2 18014 2464 20478 27.66 3059 21073  1 1289 19303 1
2 18014 3213 21227 28.83 3213 21227  0 1456 19470 1
2 18014 3724 21738    27 3724 21738  0 2065 20079 1
2 18014    .     .     .    .     .  . 2464 20478 1
2 18014    .     .     .    .     .  . 2527 20541 1
2 18014    .     .     .    .     .  . 3059 21073 1
2 18014    .     .     .    .     .  . 3216 21230 1
2 18014    .     .     .    .     .  . 3724 21738 1
1 17174   21 17195 18.89 1764 18938  0  217 17391 1
1 17174 1291 18465 19.25 1852 19026  0  469 17643 1
1 17174 1680 18854 23.34 2072 19246  0  903 18077 1
1 17174 2111 19285 22.13 2086 19260  0 1107 18281 1
1 17174 2283 19457 21.15 2111 19285  0 1568 18742 1
1 17174 3003 20177 23.41 2123 19297  0 2122 19296 1
1 17174 3934 21108  19.5 2254 19428  0 2312 19486 1
1 17174    .     .     . 2282 19456 12 3003 20177 1
1 17174    .     .     . 2312 19486  0 3934 21108 1
1 17174    .     .     . 2327 19501  0    .     . .
1 17174    .     .     . 2416 19590  0    .     . .
1 17174    .     .     . 3934 21108  0    .     . .
end
format %tdnn/dd/CCYY DIAGNOSISDATE
format %tdnn/dd/CCYY ECHODATE
format %tdnn/dd/CCYY PACEDATE
format %tdnn/dd/CCYY ECGDate

here is a full description of the data

obs: 23
vars: 11 30 Sep 2019 11:19
--------------------------------------------------------------------------------
storage display value
variable name type format label variable label
--------------------------------------------------------------------------------
ID byte %10.0g ID
DIAGNOSISDATE int %td.. DIAGNOSISDATE
TTEDATE_Dx int %10.0g TTEDATE_Dx
ECHODATE int %td.. ECHODATE
HEARTVOL double %14.2f HEARTVOL
PACEDIAG int %10.0g PACEDIAG
PACEDATE int %td.. PACEDATE
ABNLPACE byte %14.2f ABNLPACE
ECGdatedx int %14.2f ECG-date-dx
ECGDate int %td.. ECG-Date
ABNLECG byte %10.0g ABNLECG
--

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30071
#6

30 Sep 2019, 13:40

I've been working on this for a while and it's very confusing. What is going on with DIAGNOSISDATE and TTEDATE_Dx? It seems that the same patient can have multiple observations with the same DIAGNOSISDATE, and different TTEDATE_Dx values associated with them. That does not happen for the ECHO, PACE, or ECG variables: there we find only one observation of these for any given person on a given date. I have code that would work were it not for DIAGNOSISDATE and TTEDATE_Dx, but they break the code, and I don't understand how they would even fit into the final results you are looking for, so I am stymied at this point. Please show what the results you want to get would look like for, say, ID 1.
Comment
Brookline Cole

Join Date: Sep 2019

Posts: 11
#7

30 Sep 2019, 16:49

Thank you so much for writing back and for trying. I understand how this can be confusing.

So basically there are two patients in the dataset. ID 1 and 2 each patient was diagnosed with a special heart disease at a certain date (Diagnosis date). That's when our disease starts, then had several follow-up tests ECHO, PACE, and ECG at different time points (hence multiple rows). TTE means echo (sorry for not clarifying that) so TTEDATE-Dx is time from diagnosis of heart disease to the date echo was done in "days" and the same applies to PACEDIAG and ECGDATE-Dx (these are time from diagnosis to pacemaker check and time from diagnosis to ECGdate).

So before even transforming data from long to wide. I am thinking I need to create a balanced dataset. Number of rows for each patient will be number of unique procedure dates. For example if this dataset only included patient 2 with the following ECHOdates, ECG dates and pace dates:

ECHODATE

5/1/09

11/6/12

12/22/14

1/25/16

2/12/18

7/8/19

PACEDATE

11/6/12

4/22/13

12/22/14

9/11/17

2/12/18

7/8/19

ECG-Date

5/21/09

7/12/10

5/23/11

11/6/12

4/22/13

12/22/14

1/25/16

3/28/16

9/11/17

2/15/18

7/8/19

then I would have 13 unique follow up dates. the dataset would need 13 rows, and each row will represent a follow up date, then on the dates that echo was done we have the actual echo measurement (i.e. Heartvolume); and the days we have ECGs done, we can carry forward the previous echo measurement (HEARTVOL) and so on.

All follow up dates

5/1/09

5/21/09

7/12/10

5/23/11

11/6/12

4/22/13

12/22/14

1/25/16

3/28/16

9/11/17

2/12/18

2/15/18

7/8/19

Then once we do this I believe I can reshape data from long to wide. Does this make sense? Is there a better way to do this?

.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30071

30 Sep 2019, 19:00

Now I get it. The code below does what you want. I had to rename your variables so that they were more systematic and amenable to -reshape-ing. The decision to make them all lower case was just for my convenience, as that is the style I'm accustomed to. You don't need to do that part, although you will need to change the code to use mixed and upper case names if you don't do it.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte ID int(DIAGNOSISDATE TTEDATE_Dx ECHODATE) double HEARTVOL int(PACEDIAG PACEDATE) byte ABNLPACE int(ECGdatedx ECGDate) byte ABNLECG
2 18014    4 18018 22.46 1289 19303  0   24 18038 1
2 18014 1289 19303 25.46 1456 19470  1  441 18455 1
2 18014 2065 20079 26.82 2065 20079  1  756 18770 1
2 18014 2464 20478 27.66 3059 21073  1 1289 19303 1
2 18014 3213 21227 28.83 3213 21227  0 1456 19470 1
2 18014 3724 21738    27 3724 21738  0 2065 20079 1
2 18014    .     .     .    .     .  . 2464 20478 1
2 18014    .     .     .    .     .  . 2527 20541 1
2 18014    .     .     .    .     .  . 3059 21073 1
2 18014    .     .     .    .     .  . 3216 21230 1
2 18014    .     .     .    .     .  . 3724 21738 1
1 17174   21 17195 18.89 1764 18938  0  217 17391 1
1 17174 1291 18465 19.25 1852 19026  0  469 17643 1
1 17174 1680 18854 23.34 2072 19246  0  903 18077 1
1 17174 2111 19285 22.13 2086 19260  0 1107 18281 1
1 17174 2283 19457 21.15 2111 19285  0 1568 18742 1
1 17174 3003 20177 23.41 2123 19297  0 2122 19296 1
1 17174 3934 21108  19.5 2254 19428  0 2312 19486 1
1 17174    .     .     . 2282 19456 12 3003 20177 1
1 17174    .     .     . 2312 19486  0 3934 21108 1
1 17174    .     .     . 2327 19501  0    .     . .
1 17174    .     .     . 2416 19590  0    .     . .
1 17174    .     .     . 3934 21108  0    .     . .
end
format %tdnn/dd/CCYY DIAGNOSISDATE
format %tdnn/dd/CCYY ECHODATE
format %tdnn/dd/CCYY PACEDATE
format %tdnn/dd/CCYY ECGDate
rename *, lower

//  MAKE THE VARIABLE NAMES SYSTEMATIC
rename heartvol echoresult
rename pacediag pacedresult
rename abnlpace paceabnl
rename abnlecg ecgabnl

drop ttedate_dx ecgdatedx

//  MAKE A LONG LAYOUT LIST OF ALL THE
//  PROCEDURES AND ENCOUNTERS
gen long obs_no = _n
reshape long @date @result @abnl, i(obs_no) j(test) string
drop if missing(date)

//  FILL IN ALL COMBINATIONS OF TEST AND DATE FOR EACH ID
preserve
keep test date
duplicates drop
tempfile framework
save `framework'
restore
merge m:1 test date using `framework', assert(match using) nogenerate
drop obs_no
duplicates drop

//  SPREAD DIAGNOSIS DATE TO EVERY OBSERVATION OF THE ID
by id, sort: egen diagnosis_date = max(cond(test == "diagnosis", date, .))
format date diagnosis_date %tdnn/dd/CCYY
drop if test == "diagnosis"

//  NOW GO WIDE: ONE OBS PER ID#DATE
reshape wide @result @abnl, i(id date) j(test) string

//  CLEAN UP
order diagnosis_date, after(id)
drop ecgresult echoabnl paceresult
//  REPLACE MISSING RESULTS OF A TEST BY
//  CARRYING FORWARD THE LAST OBSERVATION
foreach v of varlist ecgabnl echoresult paceabnl {
    by id (date), sort: replace `v' = `v'[_n-1] if missing(`v')
}
gen days_from_diagnosis = date - diagnosis_date
rename echoresult heartvolume

Comment

Brookline Cole

Join Date: Sep 2019

Posts: 11
#9

01 Oct 2019, 14:06

Thank you very much for all your help! I was trying to figure this out for days!! really appreciate it!
Comment

Announcement

Long to Wide for multiple time varying covariates*

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment